Note: In this document, all R code is hidden by default to improve readability. You can reveal individual code chunks by clicking “Show” or display all code at once by clicking the “Code” button at the top right.
The purpose of this project is to demonstrate the implementation and
evaluation of vertical scaling for educational assessment. We’ll be
expanding on examples provided in the documentation for the plink
package.
This notebook will cover:
Vertical scaling is a statistical technique used in educational testing to create a common scale for measuring student performance across different grade levels or age groups. This approach allows educators and researchers to track and compare student progress over time, ensuring that test scores are interpretable and comparable across grades.
A few key concepts related to Vertical Scaling:
The two examples in this project use common item vertical scaling of unidimensional item parameters. The first example is a simple two-group example, and the second is a more complex example of scaling spanning 6 grades (Grades 3 - 8). In a common items scaling design, all grades are given separate tests, with adjacent grades sharing common items. The common items are used to form a linking chain. So for instance, in our example with 6 grades, the lowest grade (Grade 3) is used as the base scale, and the Grade 4 items are linked the base scale using common items, and the Grade 5 items are linked to the linked Grade 4 items, and so on.
There are other approaches to vertical scaling, such as a scaling test design. In this design, a ‘scaling test’ spanning the entire breadth of content for all grades is administered to a representative sample of students with participation from each grade. Then everyone in each grade is also given a ‘level test’ for their grade. This scaling test is used to place level test scores along the developmental scale.
For additional reading on Vertical Scaling, I’ll recommend:
First we’ll load the relevant packages, including custom package
CATFunctions to support the MLE estimation, and create a
function to employ parallel process for efficient ability estimation
during simulation.
library(plink) # Links tests
library(psych) # Descriptive Stats
library(janitor) # Clean up data
library(tidyverse) # Keep things tidy
library(mirt) # IRT Modeling
library(parallel) # To run heavy est_abilities_parallel function
library(ggthemes) # for theme_clean()
library(purrr) # Functional programming
library(lme4) # Growth curve modeling
library(nlme) # For estimating linear models w/ MLE
library(CATFunctions) # devtools::install_github("scottfrohn/CATFunctions")
options(digits = 3)
# Parallel processing function
est_abilities_parallel <- function(
# Estimates abilities for all cases in a dataframe of item responses
response_df, # dataframe of item responses
as, # a params for each item in the response_df
bs, # b params for each item in the response_df
cs # c params for each item in the response_df
) {
# Split response_DF into a list of vectors for every case
responses_list = split(response_df, seq(nrow(response_df)))
# Use all but 1 core
num_cores <- detectCores() - 1
# Make cluster for the # of selected cores
cl <- makeCluster(num_cores)
# Export local variables
clusterExport(cl,
c("est_ability_mle", "prob", "check_lengths", "as", "bs", "cs"),
envir = environment())
# Estimate ability for all responses using all identified cores
abilities <- parLapply(cl, responses_list, function(responses) {
est_ability_mle(responses, as, bs, cs)
}
)
stopCluster(cl)
# Combine it all together
abilities <- do.call(bind_rows, lapply(abilities, function(x) {
data.frame(ability_est = x$ability_est, ability_est_se = x$ability_est_se)
}))
abilities
}KB04 / Kolen - Brennan (2004): “This dataset includes three parameter logistic model (3PL) item parameter estimates for two forms of a unidimensional test for two groups. Both forms were calibrated separately. The Form X items are from a”new” form and the Form Y items are from an “old” form. The data are listed in Table 6.5 in Test Equating, Scaling, and Linking (2nd ed.). There are 12 common items between the two forms.” - Weeks (2010)
Note the same data are listed in the same table in Kolen and Brennan (2014) (the 3rd edition).
We can load the KB04 dataset from plink.
Terms I’ll use to describe the data herein are:
## discrimination difficulty asymptote
## 1 0.550 -1.7960 0.1751
## 2 0.789 -0.4796 0.1165
## 3 0.455 -0.7101 0.2087
## 4 1.444 0.4833 0.2826
## 5 0.974 -0.1680 0.2625
## 6 0.584 -0.8567 0.2038
## 7 0.860 0.4546 0.3224
## 8 1.145 -0.1301 0.2209
## 9 0.754 0.0212 0.1600
## 10 0.917 1.0139 0.3648
## 11 0.959 0.7218 0.2399
## 12 0.663 0.0506 0.1240
## 13 1.232 0.4167 0.2535
## 14 1.049 0.7882 0.1569
## 15 1.069 0.9610 0.2986
## 16 0.919 0.6099 0.2521
## 17 0.893 0.5128 0.2273
## 18 0.967 0.1950 0.0535
## 19 0.656 0.3953 0.1201
## 20 1.056 0.9481 0.2036
## 21 0.348 2.2768 0.1489
## 22 0.843 1.0601 0.2332
## 23 1.114 0.5826 0.0644
## 24 1.458 1.0241 0.2453
## 25 0.514 1.3790 0.1427
## 26 0.919 1.0782 0.0879
## 27 1.881 1.4062 0.1992
## 28 1.504 1.5093 0.1642
## 29 0.966 1.5443 0.1431
## 30 0.702 2.2401 0.0853
## 31 1.265 1.8759 0.2443
## 32 0.857 1.7140 0.0865
## 33 1.408 1.5556 0.0789
## 34 0.581 3.4728 0.1399
## 35 0.926 3.1202 0.1090
## 36 1.299 2.1589 0.1075
This is the unaltered KB04$pars$form.x table from our
dataset.
## discrimination difficulty asymptote
## 37 0.870 -1.4507 0.1576
## 38 0.463 -0.4070 0.1094
## 39 0.442 -1.3349 0.1559
## 40 0.545 -0.9017 0.1381
## 41 0.620 -1.4865 0.2114
## 42 0.573 -1.3210 0.1913
## 43 1.175 0.0691 0.2947
## 44 0.445 0.2324 0.2723
## 45 0.599 -0.7098 0.1177
## 46 0.848 -0.4253 0.1445
## 47 1.032 -0.8184 0.0936
## 48 0.604 -0.3539 0.0818
## 49 0.830 -0.0191 0.1283
## 50 0.725 -0.3155 0.0854
## 51 0.990 0.5320 0.3024
## 52 0.775 0.5394 0.2179
## 53 0.594 0.8987 0.2299
## 54 0.808 -0.1156 0.0648
## 55 0.964 -0.1948 0.1633
## 56 0.784 0.3506 0.1299
## 57 0.414 2.5538 0.2410
## 58 0.762 -0.1581 0.1137
## 59 1.196 0.5056 0.2397
## 60 1.355 0.5811 0.2243
## 61 1.187 0.6229 0.2577
## 62 1.030 0.3898 0.1856
## 63 1.042 0.9392 0.1651
## 64 1.206 1.1350 0.2323
## 65 0.970 0.6976 0.1070
## 66 0.634 1.8960 0.0794
## 67 1.082 1.3864 0.1855
## 68 1.020 0.9197 0.1027
## 69 1.135 1.0790 0.0630
## 70 1.195 1.8411 0.0999
## 71 1.196 2.0297 0.0832
## 72 0.925 2.1337 0.1259
This is the unaltered KB04$pars$form.y table from our
dataset. Note the rows are labeled 36:72 (instead of 1:36).
## form.x form.y
## [1,] 3 3
## [2,] 6 6
## [3,] 9 9
## [4,] 12 12
## [5,] 15 15
## [6,] 18 18
## [7,] 21 21
## [8,] 24 24
## [9,] 27 27
## [10,] 30 30
## [11,] 33 33
## [12,] 36 36
This is the unaltered KB04$common table from our
dataset. Note the numbers under form.y reference the row
index, and not the row label. Therefore the ‘3’ here refers to item
labeled ‘37’ in the KB04$pars$form.y table. The scaling
functions do not use the row labels, but rather the indices.
Let’s take a look at the ICCs of our common items and see how they compare from the separate calibrations.
# Named vector to change the default item names generated by item_ICC_bank (row names)
recode_vector <- setNames(KB04$common[,"form.x"], c(1:12))
# Table that contains points for ICCs
common.pars <- KB04$pars$form.x[KB04$common[,"form.x"],] %>%
rename_with(., ~ c("a","b","c")) %>%
item_ICC_bank %>%
mutate(form = "x") %>%
bind_rows(
KB04$pars$form.y[KB04$common[,"form.y"],] %>%
rename_with(., ~ c("a","b","c")) %>%
item_ICC_bank %>%
mutate(form = "y")
) %>%
mutate(item = recode(item,
!!!recode_vector))
# Faceted ICC plots
common.pars %>%
ggplot(., aes(x = thetas, y = icc, color = form)) +
geom_line() +
facet_wrap(~ item) +
scale_color_manual(values = c("x" = "green3", "y" = "blue3")) +
labs(
title = "Item Characteristic Curves",
x = expression(theta),
y = "Probability of Correct Response"
) +
theme_clean() +
theme(legend.position = "bottom")Item 27 seems to stand out, performing considerably differently across two groups. We could perform analyses to detect the extent of this differential item functioning, but that’s beyond the scope of this project. We’ll just assume 27 demonstrates unacceptable DIF, and leave that out in our scaling.
Before we conduct vertical scaling, there are a few decisions we need to make. For instance:
rescale) the IRT parameters
should we use? There are a few ways, some using the means and standard
deviations of common items (Mean/Mean "MM", Mean/Sigma
"MS"), and some using item characteristic curve
transformations (Haebara "HB", Stocking-Lord
"SL"). Each of these methods will produce Slope (A) and
Intercept (B) parameters to linearly transform item parameters from Form
X to the scale of Form Y. Let’s just go with Mean/Sigma.base.grp) should we use? Let’s use
Form Y (base.grp = 2) for the base scale, transforming Form
X item parameters to the Form Y scale. The base group number is defined
by the order in which the group appears in the irt.pars
class object we pass into the model (for us it’s
KB04$pars).D)? Using a
constant like D = 1.7 would transform our results to the
normal ogive model. For us, since the KB04 items were
calibrated with BILOG-MG using the default 1.7 scaling constant, we’ll
use the constant also.exclude any of the common items? Based on
our evaluation of the common item ICCs from above, let’s drop item 27
(exclude = list(27, NA)).# Create irt.pars object with two groups (all dichotomous items)
kb04.pm <- as.poly.mod(36) # Create a poly.mod object with 36 items
# Default model = "drm" for dichotomous responses model
# All items are of the same model family (drm)
kbo4.irt.pars <- as.irt.pars( # Turn values into an `irt.pars` object
x = KB04$pars, # An object of class `irt.pars` with multiple groups
common = KB04$common, # Matrix of common items between the groups
cat = list(rep(2,36),rep(2,36)), # Number of response categories for all 36
# items in the datasets; Two response cats
poly.mod = list(kb04.pm, kb04.pm) # Lists the poly.mod objects that defines
# the models for each item (all are the
# same, 36 'drm' items)
)
# rescale the item parameters using the Mean/Sigma linking constants,
# and exclude item 27 from the common item set
kb04.out <- plink(kbo4.irt.pars,
rescale = "MS", # Transform parameters to the base scale, using Mean/Sigma (Mean/SD) method
base.grp = 2, # The base scale is group 2 (form.y)
D = 1.7, # The scaling constant to transform the logistic model; D=1.7 is equivalent to the normal ogive model
exclude = list(27,NA)) # Excludes item 27 from the linking process
# And get the new item parameters.
kb04.pars.out <- link.pars(kb04.out)Now let’s examine a summary of the scaling.
## ------- group1/group2* -------
## Linking Constants
##
## A B
## Mean/Mean 1.144959 -0.478967
## Mean/Sigma 1.176074 -0.504188
## Haebara 1.088361 -0.441590
## Stocking-Lord 1.097363 -0.464035
##
## Common Item Descriptive Statistics
##
## Model: 3PL
## Number of Items: 11
##
## a b c
## N Pars: 11.000000 11.000000 11.000000
## Mean: To 0.770809 0.449127 0.149773
## Mean: From 0.882545 0.810591 0.155864
## SD: To 0.285828 1.293511 0.076785
## SD: From 0.366505 1.099855 0.072778
What this output tells us:
The 11 common items were all 3pl calibrated (we had 12, but dropped item 27).
The Slope (A) and Intercept (B) parameters for the different
types of linear scaling methods are presented. Since we used
rescale = 'MS', our parameters are in the row marked
Mean/Sigma which can be used to rescale the 3pl item
parameters for Form X (since Form Y was our base form). These parameter
are:
Because these are the linear differences in common items, we can somewhat interpret the intercept (about -0.50), to indicate that that the group that took Form X are on average, about 1/2 a Standard Deviation lower in ability than the group that took Form Y
The last table shows the descriptive stats of parameters for the
common items for Form Y (To) and Form X (From). Note the SD calculated
here is the population SD, so if you’re checking the calculations, the R
function sd() only reports sample SD and will be slightly
off.
Let’s check those item parameters from the final scaling object
(kb04.out). Note that Form Y (group2)
parameters should be the same, and Form X (group1)
parameters will be different.
Since they shouldn’t have changed, let’s check Form Y first.
# Pull out the form y parameters
(form.y.params <- KB04$pars$form.y %>% # 1. Original parameters, "before" scaling
rename(a.before = "discrimination",
b.before = "difficulty",
c.before = "asymptote") %>%
bind_cols(kb04.pars.out$group2 %>% # 2. New parameters, "after" scaling"
as.data.frame() %>%
setNames(c("a.after","b.after","c.after"))) %>%
mutate(item = seq(1:nrow(kb04.pars.out$group2)),
common = ifelse(item %in% data.frame(kb04.out$pars@common)[,2],
1,0)) %>%
relocate(item, common, a.before, a.after, b.before, b.after, c.before, c.after)
)## item common a.before a.after b.before b.after c.before c.after
## 37 1 0 0.870 0.870 -1.4507 -1.4507 0.1576 0.1576
## 38 2 0 0.463 0.463 -0.4070 -0.4070 0.1094 0.1094
## 39 3 1 0.442 0.442 -1.3349 -1.3349 0.1559 0.1559
## 40 4 0 0.545 0.545 -0.9017 -0.9017 0.1381 0.1381
## 41 5 0 0.620 0.620 -1.4865 -1.4865 0.2114 0.2114
## 42 6 1 0.573 0.573 -1.3210 -1.3210 0.1913 0.1913
## 43 7 0 1.175 1.175 0.0691 0.0691 0.2947 0.2947
## 44 8 0 0.445 0.445 0.2324 0.2324 0.2723 0.2723
## 45 9 1 0.599 0.599 -0.7098 -0.7098 0.1177 0.1177
## 46 10 0 0.848 0.848 -0.4253 -0.4253 0.1445 0.1445
## 47 11 0 1.032 1.032 -0.8184 -0.8184 0.0936 0.0936
## 48 12 1 0.604 0.604 -0.3539 -0.3539 0.0818 0.0818
## 49 13 0 0.830 0.830 -0.0191 -0.0191 0.1283 0.1283
## 50 14 0 0.725 0.725 -0.3155 -0.3155 0.0854 0.0854
## 51 15 1 0.990 0.990 0.5320 0.5320 0.3024 0.3024
## 52 16 0 0.775 0.775 0.5394 0.5394 0.2179 0.2179
## 53 17 0 0.594 0.594 0.8987 0.8987 0.2299 0.2299
## 54 18 1 0.808 0.808 -0.1156 -0.1156 0.0648 0.0648
## 55 19 0 0.964 0.964 -0.1948 -0.1948 0.1633 0.1633
## 56 20 0 0.784 0.784 0.3506 0.3506 0.1299 0.1299
## 57 21 1 0.414 0.414 2.5538 2.5538 0.2410 0.2410
## 58 22 0 0.762 0.762 -0.1581 -0.1581 0.1137 0.1137
## 59 23 0 1.196 1.196 0.5056 0.5056 0.2397 0.2397
## 60 24 1 1.355 1.355 0.5811 0.5811 0.2243 0.2243
## 61 25 0 1.187 1.187 0.6229 0.6229 0.2577 0.2577
## 62 26 0 1.030 1.030 0.3898 0.3898 0.1856 0.1856
## 63 27 0 1.042 1.042 0.9392 0.9392 0.1651 0.1651
## 64 28 0 1.206 1.206 1.1350 1.1350 0.2323 0.2323
## 65 29 0 0.970 0.970 0.6976 0.6976 0.1070 0.1070
## 66 30 1 0.634 0.634 1.8960 1.8960 0.0794 0.0794
## 67 31 0 1.082 1.082 1.3864 1.3864 0.1855 0.1855
## 68 32 0 1.020 1.020 0.9197 0.9197 0.1027 0.1027
## 69 33 1 1.135 1.135 1.0790 1.0790 0.0630 0.0630
## 70 34 0 1.195 1.195 1.8411 1.8411 0.0999 0.0999
## 71 35 0 1.196 1.196 2.0297 2.0297 0.0832 0.0832
## 72 36 1 0.925 0.925 2.1337 2.1337 0.1259 0.1259
# Check the number of items that have different before and after parameters
form.y.params %>%
mutate(a.diff = a.before - a.after, # Difference between the params before and after scaling
b.diff = b.before - b.after,
c.diff = c.before - c.after) %>%
select(a.diff, b.diff, c.diff) %>%
summarise(a_differences = sum(a.diff != 0),
b_differences = sum(b.diff != 0),
c_differences = sum(c.diff != 0)) %>%
t() %>% data.frame("N.items" = .)## N.items
## a_differences 0
## b_differences 0
## c_differences 0
The first table shows a side-by-side comparison of parameters
‘before’ and ‘after’ scaling for each item. Looks like nothing has
changed (which is what we expect). The second (custom) table provides a
summary of the number of different parameters, which we can confirm is 0
for a, b, and c. Looks good,
let’s check Form X now.
So let’s use those linear scaling parameters to manually rescale the
original Form X parameters and compare against the output from
plink.
Note that the parameter transformation equations are different for each parameter:
# Get the slope and intercept from the scaling output. If we wanted to, we could change the '$MS' to '$MM', '$HB', or '$SL' to access the other linear parameters for the other 3 scaling methods.
x.slope <- kb04.out$link@constants$MS[1]
x.intercept <- kb04.out$link@constants$MS[2]
(form.x.params <- KB04$pars$form.x %>%
rename(a.before = "discrimination",
b.before = "difficulty",
c.before = "asymptote") %>%
bind_cols(kb04.pars.out$group1 %>%
as.data.frame() %>%
setNames(c("a.after","b.after","c.after"))) %>%
mutate(item = seq(1:nrow(kb04.pars.out$group1)),
common = ifelse(item %in% data.frame(kb04.out$pars@common)[,1],
1,0)) %>%
relocate(item, common, a.before, a.after, b.before, b.after, c.before, c.after)
)## item common a.before a.after b.before b.after c.before c.after
## 1 1 0 0.550 0.467 -1.7960 -2.6164 0.1751 0.1751
## 2 2 0 0.789 0.671 -0.4796 -1.0682 0.1165 0.1165
## 3 3 1 0.455 0.387 -0.7101 -1.3393 0.2087 0.2087
## 4 4 0 1.444 1.228 0.4833 0.0642 0.2826 0.2826
## 5 5 0 0.974 0.828 -0.1680 -0.7018 0.2625 0.2625
## 6 6 1 0.584 0.496 -0.8567 -1.5117 0.2038 0.2038
## 7 7 0 0.860 0.732 0.4546 0.0305 0.3224 0.3224
## 8 8 0 1.145 0.973 -0.1301 -0.6572 0.2209 0.2209
## 9 9 1 0.754 0.641 0.0212 -0.4793 0.1600 0.1600
## 10 10 0 0.917 0.780 1.0139 0.6882 0.3648 0.3648
## 11 11 0 0.959 0.816 0.7218 0.3447 0.2399 0.2399
## 12 12 1 0.663 0.564 0.0506 -0.4447 0.1240 0.1240
## 13 13 0 1.232 1.048 0.4167 -0.0141 0.2535 0.2535
## 14 14 0 1.049 0.892 0.7882 0.4228 0.1569 0.1569
## 15 15 1 1.069 0.909 0.9610 0.6260 0.2986 0.2986
## 16 16 0 0.919 0.782 0.6099 0.2131 0.2521 0.2521
## 17 17 0 0.893 0.760 0.5128 0.0989 0.2273 0.2273
## 18 18 1 0.967 0.822 0.1950 -0.2749 0.0535 0.0535
## 19 19 0 0.656 0.558 0.3953 -0.0393 0.1201 0.1201
## 20 20 0 1.056 0.898 0.9481 0.6108 0.2036 0.2036
## 21 21 1 0.348 0.296 2.2768 2.1735 0.1489 0.1489
## 22 22 0 0.843 0.717 1.0601 0.7426 0.2332 0.2332
## 23 23 0 1.114 0.947 0.5826 0.1810 0.0644 0.0644
## 24 24 1 1.458 1.240 1.0241 0.7002 0.2453 0.2453
## 25 25 0 0.514 0.437 1.3790 1.1176 0.1427 0.1427
## 26 26 0 0.919 0.782 1.0782 0.7639 0.0879 0.0879
## 27 27 0 1.881 1.599 1.4062 1.1496 0.1992 0.1992
## 28 28 0 1.504 1.279 1.5093 1.2709 0.1642 0.1642
## 29 29 0 0.966 0.822 1.5443 1.3120 0.1431 0.1431
## 30 30 1 0.702 0.597 2.2401 2.1303 0.0853 0.0853
## 31 31 0 1.265 1.076 1.8759 1.7020 0.2443 0.2443
## 32 32 0 0.857 0.728 1.7140 1.5116 0.0865 0.0865
## 33 33 1 1.408 1.197 1.5556 1.3253 0.0789 0.0789
## 34 34 0 0.581 0.494 3.4728 3.5801 0.1399 0.1399
## 35 35 0 0.926 0.787 3.1202 3.1654 0.1090 0.1090
## 36 36 1 1.299 1.105 2.1589 2.0348 0.1075 0.1075
form.x.params %>%
mutate(a.calc = a.before / x.slope,
b.calc = b.before * x.slope + x.intercept,
c.calc = c.before,
a.diff = a.before - a.after, # Difference between the params before and after scaling
b.diff = b.before - b.after,
c.diff = c.before - c.after,
a.calc.diff = a.calc - a.after,
b.calc.diff = b.calc - b.after,
c.calc.diff = c.calc - c.after) %>%
select(a.diff, b.diff, c.diff, a.calc.diff, b.calc.diff, c.calc.diff) %>%
summarise(a_differences = sum(a.diff != 0),
b_differences = sum(b.diff != 0),
c_differences = sum(c.diff != 0),
a_calc_diffs = sum(a.calc.diff != 0),
b_calc_diffs = sum(b.calc.diff != 0),
c_calc_diffs = sum(c.calc.diff != 0)) %>%
t() %>% data.frame("N.items" = .)## N.items
## a_differences 36
## b_differences 36
## c_differences 0
## a_calc_diffs 0
## b_calc_diffs 0
## c_calc_diffs 0
Again, the first table shows a side-by-side comparison of parameters
‘before’ and ‘after’ scaling for each item. Looks like the
a’s and b’s have changed, but c’s
have not (as expected).
The second table shows that:
a parameters are different after
scaling.b parameters are different after
scaling.c parameters are different after
scaling.a, b, and
c parameters are different than the plink
scaled ones. Our calculated parameters using the “Mean/Sigma” values
produce the same results.From this point, we can use the updated item parameters and linear scaling parameters for our vertical scale moving forward. This means:
Now that we’ve confirmed the item parameter scaling results, let’s simulate a sample of 1000 latent abilities for two groups and check the impact of scaling. We’ll simulate with a normal distribution for both groups.
# 1. Simulate ability levels
n_test_takers <- 1000
# Group x is lower ability (this will be evident when creating response patterns, no need to change the distribution)
set.seed(123)
theta.x <- rnorm(n_test_takers)
# Group y is higher ability
set.seed(234)
theta.y <- rnorm(n_test_takers)
describe(theta.x) %>%
bind_rows(describe(theta.y)) %>%
as_tibble %>%
mutate(form = c("x","y")) %>%
relocate(form)## # A tibble: 2 × 14
## form vars n mean sd median trimmed mad min max range
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 x 1 1000 0.0161 0.992 0.00921 0.00896 0.964 -2.81 3.24 6.05
## 2 y 1 1000 -0.0216 0.985 0.0257 -0.00949 0.957 -3.04 3.10 6.13
## # ℹ 3 more variables: skew <dbl>, kurtosis <dbl>, se <dbl>
Here are descriptive statistics of simulated abilities for Group X and Group Y. Note that although the group means are the same, our vertical scaling will ultimately show that Group Y has a higher ability than Group X. Therefore we don’t need to account for that mean difference here. This is because:
Now that we have an underlying distribution of latent abilities, we
can simulate response patterns for these sims. The
CATFunctions::generate_responses functions is designed to
simulate item responses based on the 3pl IRT model. It calculates the
probability of a sim with a given theta to get an item correct, then
simulates a response for that item using the binomial distribution
(using rbinom()). The probability \(P_i(\theta)\) of a person with ability
\(\theta\) getting item \(i\) correct, given item parameters \(a_i\), \(b_i\), and \(c_i\) is as follows:
\[ P_i(\theta) = c_i + (1 - c_i) \frac{1}{1 + \exp(-a_i (\theta - b_i))} \] Where:
Note that for Form X, we are simulating response patterns based on the old item parameters. Then to simulate abilities, we’ll use the new item parameters.
set.seed(345) # for rbinom
generate_responses <- function(params, theta) {
a <- params$discrimination
b <- params$difficulty
c <- params$asymptote
n_items <- length(a)
responses <- matrix(0, nrow = length(theta), ncol = n_items)
for (i in 1:length(theta)) {
for (j in 1:n_items) {
prob <- c[j] + (1 - c[j]) / (1 + exp(-a[j] * (theta[i] - b[j])))
responses[i, j] <- rbinom(1, 1, prob)
}
}
return(responses)
}
# Generate responses for both forms
responses.x <- generate_responses(KB04$pars$form.x, theta.x)
responses.y <- generate_responses(KB04$pars$form.y, theta.y)
# Descriptive stats of sum scores
describe(rowSums(responses.x)) %>%
bind_rows(describe(rowSums(responses.y))) %>%
as_tibble() %>%
mutate(group = c("x","y")) %>%
relocate(group)## # A tibble: 2 × 14
## group vars n mean sd median trimmed mad min max range skew
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 x 1 1000 16.7 5.38 16 16.5 4.45 3 33 30 0.399
## 2 y 1 1000 18.9 5.45 19 18.9 5.93 5 35 30 0.122
## # ℹ 2 more variables: kurtosis <dbl>, se <dbl>
This table presents descriptive statistics of sum scores. Since we’ll use MLE to estimate abilities with using IRT item parameters, take these with a grain of salt.
Now let’s estimate the abilities, standard error, and percentile rank for our test-takers.
x.ability.old <- est_abilities_parallel(responses.x
, as = KB04$pars$form.x$discrimination
, bs = KB04$pars$form.x$difficulty
, cs = KB04$pars$form.x$asymptote
) %>%
mutate(pct_rank = percent_rank(ability_est))
x.ability.new <- est_abilities_parallel(responses.x
, as = kb04.pars.out$group1[,1]
, bs = kb04.pars.out$group1[,2]
, cs = kb04.pars.out$group1[,3]
) %>%
mutate(pct_rank = percent_rank(ability_est))
y.ability <- est_abilities_parallel(responses.y
, as = kb04.pars.out$group2[,1]
, bs = kb04.pars.out$group2[,2]
, cs = kb04.pars.out$group2[,3]
) %>%
mutate(pct_rank = percent_rank(ability_est))
# Combine into single DF
kb04.abilities <- x.ability.old %>%
mutate(group = "x.old",
ability_true = theta.x) %>%
bind_rows(x.ability.new %>%
mutate(group = "x.new",
ability_true = theta.x)) %>%
bind_rows(y.ability %>%
mutate(group = "y",
ability_true = theta.y)) %>%
relocate(ability_true) %>%
mutate(group = factor(group, levels = c("x.old","x.new","y")))
lapply(list(kb04.abilities$ability_est[kb04.abilities$group=="x.old"],
kb04.abilities$ability_est[kb04.abilities$group=="x.new"],
kb04.abilities$ability_est[kb04.abilities$group=="y"]),
describe) %>%
bind_rows %>%
as.data.frame() %>%
mutate(group = c("x.old","x.new","y")) %>%
as_tibble %>%
relocate(group, n, mean, sd, min, max, range, median)## # A tibble: 3 × 14
## group n mean sd min max range median vars trimmed mad
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 x.old 1000 -0.109 1.52 -10.5 3.44 13.9 0.0543 1 0.0132 1.08
## 2 x.new 1000 -0.626 1.75 -12.1 3.55 15.7 -0.440 1 -0.489 1.27
## 3 y 1000 -0.0491 1.21 -5.78 4.71 10.5 0.0439 1 -0.0145 1.12
## # ℹ 3 more variables: skew <dbl>, kurtosis <dbl>, se <dbl>
This table shows descriptive statistics for Groups X and Y:
x.old are the simulated Form X response data, with the
original Form X item parametersx.new are the simulated Form X response data, with the
new (scaled) Form X item parametersy are the simulated Form Y response data, with the Form
Y item parametersAs you can see, the x.old and y groups have a similar distribution of scores, which makes sense. The abilities were simulated along a normal distribution. When we simulate response patterns based on normally distributed true abilities, we would expect to have similar estimated ability distributions. Once we rescaled the item parameters for Form X (x.new) and estimated abilities from the same response patterns as before, we can see that the distribution shifts:
Now lets visualize our testing data, starting with score distributions.
Code to create plots…
raw.est.plot <- data.frame(raw = rowSums(responses.x),
mle = kb04.abilities$ability_est[kb04.abilities$group=="x.old"],
group = "x.old") %>%
bind_rows(
data.frame(raw = rowSums(responses.x),
mle = kb04.abilities$ability_est[kb04.abilities$group=="x.new"],
group = "x.new")
) %>%
bind_rows(
data.frame(raw = rowSums(responses.y),
mle = kb04.abilities$ability_est[kb04.abilities$group=="y"],
group = "y")
) %>%
ggplot(., aes(x = raw, y = mle, color = group)) +
facet_wrap(~group) +
geom_point(alpha = 0.3, size = 1) +
labs(title = "Raw Scores v. Scaled MLE Estimated Ability by Group",
x = "Raw Score",
y = "Scaled Ability Estimate",
color = "Group") +
theme_clean() +
theme(legend.position.inside = c(0.95, 0.05),
legend.position = "inside",
legend.justification = c("right", "bottom")) # Adjust justification
ability.plot <- kb04.abilities %>%
filter(abs(ability_est) <= 6.0) %>%
ggplot(., aes(x = ability_est, color = as.factor(group), fill = as.factor(group))) +
geom_density(alpha = 0.3) + # Overlayed density plots with some transparency
labs(title = "Overlayed Density Plot of Ability Estimates by Group",
x = "Theta",
y = "Density",
color = "Group",
fill = "Group") +
scale_x_continuous(limits = c(-5, 5)) +
scale_y_continuous(limits = c(0, 0.6)) +
theme_clean() +
theme(legend.position.inside = c(0.05, 0.95),
legend.position = "inside",
legend.justification = c("left", "top")) # Adjust justification
ability.pct.plot <- kb04.abilities %>%
ggplot(., aes(x = ability_est, y = pct_rank, color = group)) +
geom_line() +
scale_x_continuous(limits = c(-5,5)) +
labs(title = "Ability and Percentile Rank by Group",
x = "Theta",
y = "Percentile Rank",
color = "Group") +
theme_clean() +
theme(legend.position.inside = c(0.05, 0.95),
legend.position = "inside",
legend.justification = c("left", "top")) # Adjust justification
se.plot <- kb04.abilities %>%
filter(abs(ability_est) <= 6.0) %>%
ggplot(., aes(x = ability_est, y = ability_est_se, color = group)) +
geom_jitter(alpha = 0.3) +
geom_point(alpha = 0.3, size = 1) +
scale_x_continuous(limits = c(-5, 5)) +
scale_y_continuous(limits = c(0, NA)) +
labs(title = "SE of Ability Estimate by Group",
x = "Theta",
y = "Standard Error (SE)",
color = "Group",
fill = "Group") +
theme_clean() +
theme(legend.position.inside = c(0.05, 0.95),
legend.position = "inside",
legend.justification = c("left", "top")) # Adjust justificationAs we can see, there is a similar relationship between Raw Score and Scaled Ability Estimate for both groups.
Looking at the ability density, we can see that x.old
and y have similar distributions, but after vertical
scaling, the new Group X scores are more spread and shifted left.
Code to create plots…
# Create test information dataframe for plotting
kb04.ti <- test_info(thetas = seq(-4,4,.1),
a = KB04$pars$form.x$discrimination, # Original X
b = KB04$pars$form.x$difficulty,
c = KB04$pars$form.x$asymptote
) %>%
rename(x.old.info = "info", x.old.se = "se") %>%
bind_cols(
test_info(thetas = seq(-4,4,.1),
a = kb04.pars.out$group1[,1], # Scaled X
b = kb04.pars.out$group1[,2],
c = kb04.pars.out$group1[,3]
) %>%
rename(x.new.info = "info", x.new.se = "se") %>%
select(-theta)
) %>%
bind_cols(
test_info(thetas = seq(-4,4,.1),
a = kb04.pars.out$group2[,1], # Y
b = kb04.pars.out$group2[,2],
c = kb04.pars.out$group2[,3]
) %>%
rename(y.info = "info", y.se = "se") %>%
select(-theta)
)
# Create the combined plot
combined_plot.x.old <- kb04.abilities %>%
filter(group == "x.old") %>%
ggplot(aes(x = ability_est)) +
# Density plot
geom_density(alpha = 0.1, fill = "red", color = "red") +
# Test information function
geom_line(data = kb04.ti, aes(x = theta,
y = x.old.info / max(x.old.info) * max(density(kb04.abilities$ability_est[kb04.abilities$group == "x.old"])$y)),
color = "purple", linewidth = 0.5) +
# Standard error
geom_line(data = kb04.ti, aes(x = theta,
y = x.old.se / max(x.old.se) * max(density(kb04.abilities$ability_est[kb04.abilities$group == "x.old"])$y)),
color = "orange3", linewidth = 0.5) +
# Labels and theme
labs(title = "Original Form X Ability Estimate Density Plot with Test Info and SE",
x = "Theta",
y = "Density of Ability Estimates") +
theme_clean() +
scale_x_continuous(limits = c(-6,6)) +
# Add a secondary y-axis for Test Information and SE
scale_y_continuous(
limits = c(0, 0.6),
sec.axis = sec_axis(~ . * 10,
name = "Test Information / Standard Error")
) +
theme(plot.title = element_text(size = 12)) # Adjust title size
# Create the combined plot
combined_plot.x.new <- kb04.abilities %>%
filter(group == "x.new") %>%
ggplot(aes(x = ability_est)) +
# Density plot
geom_density(alpha = 0.1, fill = "green", color = "green") +
# Test information function
geom_line(data = kb04.ti, aes(x = theta,
y = x.new.info / max(x.new.info) * max(density(kb04.abilities$ability_est[kb04.abilities$group == "x.new"])$y)),
color = "purple", linewidth = 0.5) +
# Standard error
geom_line(data = kb04.ti, aes(x = theta,
y = x.new.se / max(x.new.se) * max(density(kb04.abilities$ability_est[kb04.abilities$group == "x.new"])$y)),
color = "orange3", linewidth = 0.5) +
# Labels and theme
labs(title = "Scaled Form X Ability Estimate Density Plot with Test Info and SE",
x = "Theta",
y = "Density of Ability Estimates") +
theme_clean() +
scale_x_continuous(limits = c(-6,6)) +
# Add a secondary y-axis for Test Information and SE
scale_y_continuous(
limits = c(0, 0.6),
sec.axis = sec_axis(~ . * 10,
name = "Test Information / Standard Error")
) +
theme(plot.title = element_text(size = 12)) # Adjust title size
combined_plot.y <- kb04.abilities %>%
filter(group == "y") %>%
ggplot(aes(x = ability_est)) +
# Density plot
geom_density(alpha = 0.1, fill = "blue", color = "blue") +
# Test information function
geom_line(data = kb04.ti, aes(x = theta,
y = y.info / max(y.info) * max(density(kb04.abilities$ability_est[kb04.abilities$group == "y"])$y)),
color = "purple", linewidth = 0.5) +
# Standard error
geom_line(data = kb04.ti, aes(x = theta, y =
y.se / max(y.se) * max(density(kb04.abilities$ability_est[kb04.abilities$group == "y"])$y)),
color = "orange3", linewidth = 0.5) +
# Labels and theme
labs(title = "Form Y Ability Estimate Density Plot with Test Info and SE", size = 2) +
labs(x = "Theta",
y = "Density of Ability Estimates") +
theme_clean() +
scale_x_continuous(limits = c(-6,6)) +
# Add a secondary y-axis for Test Information and SE
scale_y_continuous(
limits = c(0, 0.6),
sec.axis = sec_axis(~ . * 10,
name = "Test Information / Standard Error")
) +
theme(plot.title = element_text(size = 12)) # Adjust title sizeNote: the purple line represents Test Information, and the orange line represents the Test Standard Error.
Note: the purple line represents Test Information, and the orange line represents the Test Standard Error.
Now that we’ve performed vertical scaling for our sample, and estimated abilities for our test takers, we should convert the ability estimates (in logits) to some other scale.. This process is called scaling and it helps score interpretation and tracking for our audience.
Let’s look at the descriptive statistics of abilities of our two groups.
psych::describe(x.ability.new$ability_est) %>% select(n, mean, sd, median, min, max, range) %>%
bind_rows(
psych::describe(y.ability$ability_est) %>% select(n, mean, sd, median, min, max, range)
) %>% as_tibble %>%
mutate(group = c("x.new","y")) %>% relocate(group)## # A tibble: 2 × 8
## group n mean sd median min max range
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 x.new 1000 -0.626 1.75 -0.440 -12.1 3.55 15.7
## 2 y 1000 -0.0491 1.21 0.0439 -5.78 4.71 10.5
We have quite the range here, with some extreme values for the x.new group (min = -12.11), and even for the y group (min = -5.78). There are both linear and non-linear methods we could employ to convert these ability estimates to another scale. Let’s look at linear methods first.
Let’s assume we want our transformed scale to have the following limits:
In this scaling option, we would use the min and max values from our entire dataset (across all forms), and basically draw a line between the min (-12.11, 200) and max (4.71, 800) and all scores in between would be linearly transformed to that scale. Let’s do that:
The formula for this is: \[S = A \cdot \theta + B\] Where:
And the formula for computing A and B: \[A = \frac{S_{\text{max}} - S_{\text{min}}}{\theta_{\text{max}} - \theta_{\text{min}}}\] \[B = S_{\text{min}} - A \cdot \theta_{\text{min}}\]
Let’s compute that and create a scaled.scores dataframe.
scaled.scores <- kb04.abilities %>%
filter(group %in% c("x.new","y"))
A <- (800 - 200) / (max(scaled.scores$ability_est) - min(scaled.scores$ability_est))
B <- 200 - A * (min(scaled.scores$ability_est))
scaled.scores <- scaled.scores %>%
mutate(scale_linear = ceiling(A * ability_est + B))
# Plot
scaled.scores %>%
ggplot(., aes(x = ability_est, y = scale_linear)) +
geom_point(alpha = 0.2, color = "blue3") +
labs(title = "Linear Scaling using Scale Limits",
x = "Theta",
y = "Scaled Score") +
theme_clean()This plot presents the linear scaling results for all sims (Groups X and Y). It also highlights the outliers at the low end of the theta scale, in that values below -5 cover scaled scores of about 200 to 450 or so. In that sense, this method kind of wastes a large portion of our scale on some extreme outliers, and the scaled scores in the middle distribution are relatively narrow. Let’s break up our sample into quintiles and look at the scale range for each group.
scale_breakdown <- function(df, varb) {
scaling <- function(df, varb) {
df %>%
summarise(
scale_min = min({{varb}}, na.rm = TRUE),
scale_max = max({{varb}}, na.rm = TRUE),
scale_range = scale_max - scale_min
)
}
result <- bind_rows(
df %>% scaling({{varb}}),
df %>% filter(pct_rank < 0.20) %>% scaling({{varb}}),
df %>% filter(pct_rank >= 0.20 & pct_rank < 0.40) %>% scaling({{varb}}),
df %>% filter(pct_rank >= 0.40 & pct_rank < 0.60) %>% scaling({{varb}}),
df %>% filter(pct_rank >= 0.60 & pct_rank < 0.80) %>% scaling({{varb}}),
df %>% filter(pct_rank >= 0.80) %>% scaling({{varb}})
) %>%
mutate(group = c("Total", "0%-20%", "20%-40%", "40%-60%", "60%-80%", "80%-100%")) %>%
relocate(group)
return(result)
}
scaled.scores %>% scale_breakdown(scale_linear)## group scale_min scale_max scale_range
## 1 Total 200 800 600
## 2 0%-20% 200 598 398
## 3 20%-40% 575 623 48
## 4 40%-60% 605 643 38
## 5 60%-80% 627 664 37
## 6 80%-100% 655 800 145
This shows that the scale range for the bottom 20% of our sample spans 398 scaled points (nearly 2/3 of our range!). Not only is that a lot of dedicated scale range for a mere fifth of our sample, but examining our Info and SE plots, we have more error than information below about -1.5 logits, meaning we don’t have much score precision for MLE estimates in that range anyway. Conversely, the middle fifth of our sims (40% - 60%) is separated by a mere 38 points. That’s not much separation for a group of test-takers for which we do have a lot of measurement precision.
Let’s consider another linear approach provides better separation in the middle of the distribution.
One thing we could do to address this is create a piecewise linear scale, where we select min and max Theta values to derive this scale, and any scores below or above (respectively) those thresholds are given the low or high score. For instance, our scale could be:
Here’s how that would look if we set our min(theta) and max(theta) to -4 and 4, respectively.
A.piece <- (800 - 200) / (4 - (-4))
B.piece <- 200 - A.piece * (-4)
scaled.scores <- scaled.scores %>%
mutate(scale_linear_piece = ceiling(A.piece * ability_est + B.piece),
scale_linear_piece = ifelse(scale_linear_piece < 200, 200,
ifelse(scale_linear_piece > 800, 800, scale_linear_piece)))
scaled.scores %>%
ggplot(aes(x = ability_est)) +
geom_point(aes(y = scale_linear, color = "Limits"), alpha = 0.2) +
geom_point(aes(y = scale_linear_piece, color = "Limits, Piecewise"), alpha = 0.2) +
scale_color_manual(
name = "Method",
values = c("Limits" = "blue3", "Limits, Piecewise" = "green3"),
labels = c("Limits", "Limits, Piecewise")
) +
labs(
title = "Linear Scaling Results",
x = "Theta",
y = "Scaled Score"
) +
theme_clean() +
theme(legend.position.inside = c(0.05, 0.95),
legend.position = "inside",
legend.justification = c("left", "top")) # Adjust justification## group scale_min scale_max scale_range
## 1 Total 200 800 600
## 2 0%-20% 200 428 228
## 3 20%-40% 379 481 102
## 4 40%-60% 442 522 80
## 5 60%-80% 488 567 79
## 6 80%-100% 548 800 252
That looks a lot more reasonable, with the middle quintiles each having a spread of around 80 to 100 points. We could take this a step further and set the theta limits to something even more narrow (e.g., c(-3,3)), but we’ll leave it be for now. Let’s try one more linear scaling option.
Another option is a transformation based using the mean and SD of our normative (base sample) scores. Let’s assume we want our transformed scale to have the following features:
The formula for this is: \[S = \frac{\sigma(S)}{\sigma(\theta)} \theta + \left[\mu(S) - \frac{\sigma(S)}{\sigma(\theta)} \mu(\theta)\right]\] Where:
So let’s implement this and take a look at how that compares to our scaling based on using scale limits
sd.S <- 100
sd.Theta <- sd(scaled.scores$ability_est)
mean.S <- 500
mean.Theta <- mean(scaled.scores$ability_est)
scaled.scores <- scaled.scores %>%
mutate(scale_linear_MS = (sd.S / sd.Theta) * ability_est + (mean.S - (sd.S / sd.Theta) * mean.Theta))
scaled.scores %>%
ggplot(aes(x = ability_est)) +
geom_point(aes(y = scale_linear, color = "Limits"), alpha = 0.2) +
geom_point(aes(y = scale_linear_piece, color = "Limits, Piecewise"), alpha = 0.2) +
geom_point(aes(y = scale_linear_MS, color = "Distribution"), alpha = 0.2) +
scale_color_manual(
name = "Method",
values = c("Limits" = "blue3", "Limits, Piecewise" = "green3", "Distribution" = "orange3"),
labels = c("Distribution", "Limits", "Limits, Piecewise")
) +
labs(
title = "Linear Scaling Results",
x = "Theta",
y = "Scaled Score"
) +
theme_clean() +
theme(legend.position.inside = c(0.95, 0.05),
legend.position = "inside",
legend.justification = c("right", "bottom")) # Adjust justification## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 2000 500 100 508 506 80 -269 830 1099 -1.78 9.33 2.24
As we can see, while this new scale spreads out the scaled scores more, it also extends well below the 200 point scale limit we identified earlier. Perhaps we’re comfortable offering scores of -269 to test-takers, but probably not. Let’s take a look at our quintiles.
## group scale_min scale_max scale_range
## 1 Total -269 830 1099.2
## 2 0%-20% -269 458 727.9
## 3 20%-40% 416 505 88.8
## 4 40%-60% 471 541 69.7
## 5 60%-80% 512 580 68.5
## 6 80%-100% 564 830 266.1
Again, the range for the bottom 20% is enormous (728), and that of our middle group is still comparatively narrow.
The linear piecewise scaling approach seems to give us the best spread for the middle quintiles, though the downside is we artificially limit (remove) variability beyond the bounds we set (e.g., everyone below -4 logits has the same scaled score)
Now let’s consider one non-linear transformation and see how it compares to the linear scaling.
There are many different non-linear approaches we could take, but let’s just do take a logistic approach. Here’s one such transformation we could use:
\[S = \frac{S_{\text{max}} - S_{\text{min}}}{1 + e^{-a(\theta - \theta_0)}} + S_{\text{min}}\]
Where \(\theta_0\) is the midpoint of the latent ability scale, and \(a\) is a parameter that controls the steepness of the logistic curve. Let’s just assume \(a = 1\) and leave it out of our calculation.
# Median of the theta
mid.Theta <- median(scaled.scores$ability_est)
# Add a scaled score based on logistic
scaled.scores <- scaled.scores %>%
mutate(scale_nl_logistic = (800 - 200)/(1 + exp(-1 *(ability_est - mid.Theta))) + 200)
scaled.scores %>%
ggplot(aes(x = ability_est)) +
geom_point(aes(y = scale_linear, color = "L: Limits"), alpha = 0.2) +
geom_point(aes(y = scale_linear_piece, color = "L: Limits, Piecewise"), alpha = 0.2) +
geom_point(aes(y = scale_linear_MS, color = "L: Distribution"), alpha = 0.2) +
geom_point(aes(y = scale_nl_logistic, color = "NL: Logistic"), alpha = 0.2) +
scale_color_manual(
name = "Method",
values = c("L: Limits" = "blue3", "L: Limits, Piecewise" = "green3", "L: Distribution" = "orange3",
"NL: Logistic" = "red3"),
labels = c("L: Distribution", "L: Limits", "L: Limits, Piecewise",
"NL: Logistic")
) +
labs(
title = "Scaling Results",
x = "Theta",
y = "Scaled Score"
) +
theme_clean() +
theme(legend.position.inside = c(0.95, 0.05),
legend.position = "inside",
legend.justification = c("right", "bottom")) # Adjust justification## group scale_min scale_max scale_range
## 1 Total 200 796 596
## 2 0%-20% 200 391 191
## 3 20%-40% 318 493 175
## 4 40%-60% 417 573 156
## 5 60%-80% 507 650 143
## 6 80%-100% 620 796 176
That gives us a much more even distribution of scale ranges for each of our quintiles. However, we should note some downsides to using non-linear transformations of scaled scores, especially when it comes to tracking progress over time.
Given the drawbacks of using a non-linear scaling method, the piecewise method is probably the most appropriate for our situation.
Now that we’ve seen Vertical Scaling in action with 2 groups, let’s move into the next example in which we use Vertical Scaling across 6 grades and simulate longitudinal data for each timepoint.
TK07 / Tong - Kolen (2007): This dataset includes
unidimensional three parameter logistic model (3PL) item parameter
estimates from simulated dataset based on items from the 1992 Iowa Test
of Basic Skills vocabulary test. The data are for six groups (grades
3-8) with varying numbers of common items between adjacent grades. We
can load the TK07 dataset from plink.
Let’s load the dataset, and see how many items we’re working with.
## grade3 grade4 grade5 grade6 grade7 grade8
## 26 34 37 40 41 43
## g3.g4 g4.g5 g5.g6 g6.g7 g7.g8
## 13 21 16 24 17
So we’ve got one form for each of 6 grades (g3-g8), with adjacent forms having common items. For instance, this output shows there are 26 items in g3 and 34 items in g4, and those two grades have 13 items in common.
Let’s perform vertical scaling using this data. Here are the settings
we’ll use for this example: 1. We’ll rescale using the Mean/Sigma
method, "MS". 2. The base group this time will be group 1
(Grade 3) 3. We’ll use the scaling constant of 1.7, since these items
were also calibrated using BILOG-MG. 4. We won’t throw out any common
items this time.
# Get the # items items in each
tk07.n.items <- list(nrow(TK07$pars$grade3),
nrow(TK07$pars$grade4),
nrow(TK07$pars$grade5),
nrow(TK07$pars$grade6),
nrow(TK07$pars$grade7),
nrow(TK07$pars$grade8))
# Create irt.pars object with six groups (all dichotomous items)
tk07.pm <- lapply(tk07.n.items, as.poly.mod)
# Number of response categories for each item in each dataset (all )
tk07.cat <- lapply(tk07.n.items, function(x) rep(2, x))
tk07.irt.pars <- as.irt.pars(# Turn values into an `irt.pars` object
x = TK07$pars, # An object of class `irt.pars` with multiple groups
common = TK07$common, # Matrix of common items between the groups
cat = tk07.cat, # Number of response categories (2) for each item in the dataset
poly.mod = tk07.pm, # Lists the poly.mod objects that defines the models for each item
grp.names = paste0("grade",3:8)
)
# rescale the item parameters using the Mean/Sigma linking constants,
tk07.out <- plink(tk07.irt.pars,
rescale = "MS", # Transform parameters to the base scale, using Mean/Sigma (Mean/SD) method
base.grp = 1, # The base scale is Group 1 (default; Grade 3)
D = 1.7) # Scaling constant to transform the logistic model; D=1.7 ~~ normal ogive model
tk07.pars.out <- link.pars(tk07.out)We won’t examine the scaling results this time, since we’ll have 10 tables describing the linear scaling results for all adjacent grade pairs.
Now, let’s simulate data as if we are tracking the same group of 1000 students longitudinally.
To simulate longitudinal data, we’ll start by estimating g3 ability,
apply a base growth rate, then mean center the abilities for each grade.
Since we’re working with 6 groups, we’ll put these abilities in a
dataframe tk07.true.abilities to help us keep track of
them.
# 1. Simulate ability levels
n_test_takers <- 1000
set.seed(012)
# Simulate initial abilities
tk07_initial_abilities <- rnorm(n_test_takers)
# Select a growth rate, and adjust for variability by starting ability (lower ability students grow a little less rapidly, and higher ability students grow a little more rapidly)
base_growth_rate <- 0.3 # Average growth per time period
# Create abilities dataframe
tk07.true.abilities <- data.frame(g3 = tk07_initial_abilities) %>%
# Create a dataframe of jitter to add to longitudinal data
mutate(
# Simulate growth, including "0.05 * gX" to represent that higher ability students make greater gains.
g4 = base_growth_rate + g3 * 1.05,
g5 = base_growth_rate + g4 * 1.05,
g6 = base_growth_rate + g5 * 1.05,
g7 = base_growth_rate + g6 * 1.05,
g8 = base_growth_rate + g7 * 1.05,
#Then re-center those scores about zero, since we're simulating response patterns for separate tests, with means of about 0 anyway. When we simulate response patterns and apply scaling, the growth will be evident.
g4 = g4 - mean(g4),
g5 = g5 - mean(g5),
g6 = g6 - mean(g6),
g7 = g7 - mean(g7),
g8 = g8 - mean(g8),
student = 1:n())
# And visualize the growth
tk07.true.abilities %>%
pivot_longer(cols = c(g3, g4, g5, g6, g7, g8),
names_to = "grade",
values_to = "ability") %>%
arrange(ability) %>%
ggplot(., aes(x = grade, y = ability, group = student, color = as.factor(student))) +
geom_line(alpha = 0.2) +
theme_clean() +
theme(legend.position = "none")## vars n mean sd median trimmed mad min max range skew
## g3 1 1000 -0.0264 0.959 -0.0412 -0.031 0.891 -3.05 3.11 6.15 0.0233
## g4 1 1000 1.06e-16 1.01 -0.0155 -0.00478 0.935 -3.17 3.29 6.46 0.0233
## g5 1 1000 5.8e-17 1.06 -0.0163 -0.00502 0.982 -3.33 3.45 6.78 0.0233
## g6 1 1000 -4.23e-17 1.11 -0.0171 -0.00527 1.03 -3.5 3.63 7.12 0.0233
## g7 1 1000 1.17e-16 1.17 -0.018 -0.00553 1.08 -3.67 3.81 7.48 0.0233
## g8 1 1000 -1.5e-16 1.22 -0.0189 -0.00581 1.14 -3.85 4 7.85 0.0233
## kurtosis se
## g3 0.103 0.0303
## g4 0.103 0.0319
## g5 0.103 0.0334
## g6 0.103 0.0351
## g7 0.103 0.0369
## g8 0.103 0.0387
Here are the simulated abilities for our sample of sims over time. As you can see, the growth trajectories are just straight lines (given the flat growth rate), which is fine, since our process for simulating response patterns will introduce some variabilty. Right now, each timepoint has a mean of about 0, yet the ability variability increases slightly over time.
Now let’s generate response patterns for each timepoint, using the same method as before, based on the original item parameters.
set.seed(678) # for rbinom
# Generate responses for all forms
tk07.responses <- list(
"responses.3" = generate_responses(TK07$pars$grade3, tk07.true.abilities$g3),
"responses.4" = generate_responses(TK07$pars$grade4, tk07.true.abilities$g4),
"responses.5" = generate_responses(TK07$pars$grade5, tk07.true.abilities$g5),
"responses.6" = generate_responses(TK07$pars$grade6, tk07.true.abilities$g6),
"responses.7" = generate_responses(TK07$pars$grade7, tk07.true.abilities$g7),
"responses.8" = generate_responses(TK07$pars$grade8, tk07.true.abilities$g8)
)
lapply(tk07.responses, function(x) {psych::describe(rowSums(x))}) %>%
bind_rows %>%
as_tibble() %>%
mutate(timepoint = c("g3","g4","g5","g6","g7","g8")) %>%
relocate(timepoint)## # A tibble: 6 × 14
## timepoint vars n mean sd median trimmed mad min max range
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 g3 1 1000 16.8 4.49 17 17.0 4.45 4 26 22
## 2 g4 1 1000 20.9 6.22 21 21.1 7.41 4 34 30
## 3 g5 1 1000 21.9 7.38 22 21.9 8.90 5 37 32
## 4 g6 1 1000 23.3 9.16 24 23.4 11.9 4 40 36
## 5 g7 1 1000 23.6 9.58 24 23.6 11.9 4 41 37
## 6 g8 1 1000 23.9 10.8 23.5 23.8 14.1 1 43 42
## # ℹ 3 more variables: skew <dbl>, kurtosis <dbl>, se <dbl>
Again, take these results with a grain of salt. But one thing of note is that, just like our latent abilities, the SD of raw scores increases over time (as intended).
Now let’s estimate the abilities, standard error, and percentile rank for our test-takers.
tk07.abilities <- list(
"g3" = est_abilities_parallel(tk07.responses$responses.3 %>% data.frame,
as = tk07.pars.out$grade3[,1],
bs = tk07.pars.out$grade3[,2],
cs = tk07.pars.out$grade3[,3]
) %>%
mutate(timepoint = "g3",
ability_true = tk07.true.abilities$g3),
"g4" = est_abilities_parallel(tk07.responses$responses.4,
as = tk07.pars.out$grade4[,1],
bs = tk07.pars.out$grade4[,2],
cs = tk07.pars.out$grade4[,3]
) %>%
mutate(timepoint = "g4",
ability_true = tk07.true.abilities$g4),
"g5" = est_abilities_parallel(tk07.responses$responses.5,
as = tk07.pars.out$grade5[,1],
bs = tk07.pars.out$grade5[,2],
cs = tk07.pars.out$grade5[,3]
) %>%
mutate(timepoint = "g5",
ability_true = tk07.true.abilities$g5),
"g6" = est_abilities_parallel(tk07.responses$responses.6,
as = tk07.pars.out$grade6[,1],
bs = tk07.pars.out$grade6[,2],
cs = tk07.pars.out$grade6[,3]
) %>%
mutate(timepoint = "g6",
ability_true = tk07.true.abilities$g6),
"g7" = est_abilities_parallel(tk07.responses$responses.7,
as = tk07.pars.out$grade7[,1],
bs = tk07.pars.out$grade7[,2],
cs = tk07.pars.out$grade7[,3]
) %>%
mutate(timepoint = "g7",
ability_true = tk07.true.abilities$g7),
"g8" = est_abilities_parallel(tk07.responses$responses.8,
as = tk07.pars.out$grade8[,1],
bs = tk07.pars.out$grade8[,2],
cs = tk07.pars.out$grade8[,3]
) %>%
mutate(timepoint = "g8",
ability_true = tk07.true.abilities$g8)
) %>% bind_rows
# And the abilities if we used the old parameters
tk07.abilities.old <- list(
"g3" = est_abilities_parallel(tk07.responses$responses.3 %>% data.frame,
as = TK07$pars$grade3[,1],
bs = TK07$pars$grade3[,2],
cs = TK07$pars$grade3[,3]
) %>%
mutate(timepoint = "g3",
ability_true = tk07.true.abilities$g3),
"g4" = est_abilities_parallel(tk07.responses$responses.4,
as = TK07$pars$grade4[,1],
bs = TK07$pars$grade4[,2],
cs = TK07$pars$grade4[,3]
) %>%
mutate(timepoint = "g4",
ability_true = tk07.true.abilities$g4),
"g5" = est_abilities_parallel(tk07.responses$responses.5,
as = TK07$pars$grade5[,1],
bs = TK07$pars$grade5[,2],
cs = TK07$pars$grade5[,3]
) %>%
mutate(timepoint = "g5",
ability_true = tk07.true.abilities$g5),
"g6" = est_abilities_parallel(tk07.responses$responses.6,
as = TK07$pars$grade6[,1],
bs = TK07$pars$grade6[,2],
cs = TK07$pars$grade6[,3]
) %>%
mutate(timepoint = "g6",
ability_true = tk07.true.abilities$g6),
"g7" = est_abilities_parallel(tk07.responses$responses.7,
as = TK07$pars$grade7[,1],
bs = TK07$pars$grade7[,2],
cs = TK07$pars$grade7[,3]
) %>%
mutate(timepoint = "g7",
ability_true = tk07.true.abilities$g7),
"g8" = est_abilities_parallel(tk07.responses$responses.8,
as = TK07$pars$grade8[,1],
bs = TK07$pars$grade8[,2],
cs = TK07$pars$grade8[,3]
) %>%
mutate(timepoint = "g8",
ability_true = tk07.true.abilities$g8)
) %>% bind_rows
# Add Percentile Rank, raw score, and percent scores
tk07.abilities <- tk07.abilities %>%
group_by(timepoint) %>%
mutate(pct_rank = percent_rank(ability_est)) %>%
ungroup %>%
mutate(pct_rank_overall = percent_rank(ability_est),
student = rep(1:1000,6)) %>%
left_join(lapply(TK07$pars, nrow) %>% # Pull out # of items for each form
unlist %>%
data.frame("n_items" = .) %>%
rownames_to_column("timepoint") %>%
mutate(timepoint = gsub("grade","g",timepoint) %>% as.character,
n_items = as.numeric(n_items)),
by = "timepoint") %>%
mutate(raw_score = unlist(lapply(tk07.responses, rowSums)),
pct_score = raw_score / n_items) %>%
select(-n_items)
tk07.abilities.old <- tk07.abilities.old %>%
group_by(timepoint) %>%
mutate(pct_rank = percent_rank(ability_est)) %>%
ungroup %>%
mutate(pct_rank_overall = percent_rank(ability_est),
student = rep(1:1000,6)) %>%
mutate(raw_score = unlist(lapply(tk07.responses, rowSums)))
# Descriptive Stats
tk07.abilities %>%
group_by(timepoint) %>%
summarise(
n = n(),
mean = mean(ability_est),
sd = sd(ability_est),
min = min(ability_est),
max = max(ability_est),
range = max - min,
median = median(ability_est))## # A tibble: 6 × 8
## timepoint n mean sd min max range median
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 g3 1000 0.000127 1.25 -11.7 3.29 15.0 0.0174
## 2 g4 1000 0.434 1.45 -12.7 4.34 17.0 0.508
## 3 g5 1000 0.813 1.55 -8.91 4.86 13.8 0.872
## 4 g6 1000 1.13 1.74 -8.55 5.60 14.1 1.26
## 5 g7 1000 1.33 1.63 -7.06 5.45 12.5 1.43
## 6 g8 1000 1.36 2.09 -6.23 5.71 11.9 1.58
These descriptive statistics are now on our vertical scale, and more interpretable. As we can see, our mean (and median) incraeses over time, as does the SD of scores (for the most part).
Code to create distribution plots…
tk07.raw.mle.plot <- tk07.abilities %>%
ggplot(., aes(x = pct_score, y = ability_est, color = timepoint)) +
facet_wrap(~timepoint) +
geom_point(alpha = 0.3, size = 1) +
labs(title = "Raw Scores v. Scaled MLE Estimated Ability by Timepoint",
x = "Percent Score",
y = "Scaled Ability Estimate",
color = "Timepoint") +
theme_clean() +
theme(legend.position = "none")
tk07.ability.plot <- tk07.abilities %>%
filter(abs(ability_est) <= 6.0) %>%
ggplot(., aes(x = ability_est, color = as.factor(timepoint), fill = as.factor(timepoint))) +
geom_density(alpha = 0.1) + # Overlayed density plots with some transparency
labs(title = "Overlayed Density Plot of Ability Estimates by Group",
x = "Theta",
y = "Density",
color = "Timepoint",
fill = "Timepoint") +
scale_x_continuous(limits = c(-6, 6)) +
scale_y_continuous(limits = c(0, 0.6)) +
theme_clean()
tk07.ability.pct.plot <- tk07.abilities %>%
ggplot(., aes(x = ability_est, y = pct_rank, color = timepoint)) +
geom_line() +
scale_x_continuous(limits = c(-5,5)) +
labs(title = "Ability and Percentile Rank by Timepoint",
x = "Theta",
y = "Percentile Rank",
color = "Timepoint") +
theme_clean() +
theme(legend.position.inside = c(0.05, 0.95),
legend.position = "inside",
legend.justification = c("left", "top")) # Adjust justification
tk07.se.plot <- tk07.abilities %>%
filter(abs(ability_est) <= 6.0) %>%
ggplot(., aes(x = ability_est, y = ability_est_se, color = timepoint)) +
geom_jitter(alpha = 0.2) +
geom_point(alpha = 0.2, size = 1) +
scale_x_continuous(limits = c(-6, 6)) +
scale_y_continuous(limits = c(0, NA)) +
labs(title = "SE of Ability Estimate by Group",
x = "Theta",
y = "Standard Error (SE)",
color = "Timepoint",
fill = "Timepoint") +
theme_clean()As we can see, there is a similar relationship between Score and
Scaled Ability Estimate for all groups. Note that each form has
different number of items, so I’ve used pct_score instead
of raw_score for comparability.
Based the ability estimates from vertical scaling, we can see the the ability distributions increase (and becomre more spread) across grade.
Now let’s look at the test information and standard error for each
# List out the names of our grades
grades <- c("g3","g4","g5","g6","g7","g8")
# Create test information dataframe for plotting
for(i in 1:length(tk07.pars.out)) {
if(i == 1) {
tk07.ti <- data.frame(thetas = seq(-4,6,.1))
}
temp.ti <- test_info(thetas = seq(-4,6,.1),
a = tk07.pars.out[[i]][,1],
b = tk07.pars.out[[i]][,2],
c = tk07.pars.out[[i]][,3]
) %>%
select(-theta)
colnames(temp.ti) <- c(paste0(grades[i],".info"),paste0(grades[i],".se"))
tk07.ti <- tk07.ti %>%
bind_cols(temp.ti)
}
# For loop to create plots for every
for(i in 1:length(grades)) {
if(i == 1) {
plots.ls <- list()
}
# Prep objects & values for ggplot
abilities <- tk07.abilities %>% filter(timepoint == grades[i])
abilities.old <- tk07.abilities.old %>% filter(timepoint == grades[i])
abilities.all <- abilities %>%
mutate(Scale = "Scaled") %>%
bind_rows(abilities.old %>%
mutate(Scale = "Unscaled")
)
ti <- tk07.ti %>% select(thetas, starts_with(grades[i]))
colnames(ti) <- c("theta", "info", "se")
max.info <- max(ti$info)
max.se <- max(ti$se)
max.density <- max(max(density(abilities.all$ability_est[abilities.all$Scale=="Scaled"])$y),
density(abilities.all$ability_est[abilities.all$Scale=="Unscaled"])$y)
# Get colors to match the previous plots
default_colors <- hcl(h = seq(15, 375,
length = length(grades) + 1),
l = 65,
c = 100)[1:length(grades)]
# Create plot
plots.ls[[i]] <- ggplot() +
# Test information function
geom_line(data = ti, aes(x = theta, y = info),
color = "purple", linewidth = 0.5) +
# Standard error
geom_line(data = ti, aes(x = theta, y = se),
color = "orange3", linewidth = 0.5, linetype = "dotted") +
# Density plot
geom_density(data = abilities.all,
aes(x = ability_est,
y = after_stat(density) * max.info / max.density,
fill = Scale,
color = Scale),
alpha = 0.1) +
# Set colors for fill and color
scale_fill_manual(values = c("Scaled" = default_colors[i], "Unscaled" = "grey")) +
scale_color_manual(values = c("Scaled" = default_colors[i], "Unscaled" = "grey")) +
labs(title = paste0(toupper(grades[i]), " Ability Estimate Density Plot with Test Info and SE"),
x = "Theta",
y = "Test Information / Standard Error") +
theme_clean() +
scale_x_continuous(limits = c(-6, 6)) +
# Set y-axis limits for Test Information and SE
scale_y_continuous(
limits = c(0, 20),
name = "Test Information / Standard Error",
sec.axis = sec_axis(~ . * max.density / max.info,
name = "Density of Ability Estimates")
) +
theme(plot.title = element_text(size = 12), # Adjust title size
legend.position.inside = c(0.05, 0.95),
legend.position = "inside",
legend.justification = c("left", "top") # Adjust justification
)
}Let’s take the Piecewise Linear scaling method as before. But first, let’s take a look at the distribution of our ability estimates:
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 6000 0.84 1.71 0.9 0.91 1.36 -12.7 5.71 18.4 -1.35 6.6 0.02
Since we have a broader range, let’s select more relaxed limits for our theta scale: c(-5, 5). Here’s a plot of the scaling for all grades.
A.piece <- (800 - 200) / (5 - (-5))
B.piece <- 200 - A.piece * (-5)
tk07.scaled.scores <- tk07.abilities %>%
mutate(scale_linear_piece = ceiling(A.piece * ability_est + B.piece),
scale_linear_piece = ifelse(scale_linear_piece < 200, 200,
ifelse(scale_linear_piece > 800, 800, scale_linear_piece)))
tk07.scaled.scores %>%
ggplot(., aes(x = ability_est, y = scale_linear_piece, group = timepoint, color = timepoint)) +
geom_jitter(alpha = 0.2) +
# scale_color_manual(name = "Grade") +
labs(
title = "Linear Scaling Results",
x = "Theta",
y = "Scaled Score",
color = "Timepoint",
group = "Timepoint"
) +
theme_clean() +
theme(legend.position.inside = c(0.05, 0.95), # Position in the top left corner
legend.position = "inside",
legend.justification = c("left", "top")) # Adjust justificationLooks like we have more students with extreme values that received the scale minimum of 200, and a few that received a maximum of 800. Let’s see how many min and max scores we have by grade.
tk07.scaled.scores %>%
group_by(timepoint) %>%
summarise(n_200_score = sum(scale_linear_piece == 200),
n_800_score = sum(scale_linear_piece == 800))## # A tibble: 6 × 3
## timepoint n_200_score n_800_score
## <chr> <int> <int>
## 1 g3 3 0
## 2 g4 5 0
## 3 g5 10 0
## 4 g6 18 6
## 5 g7 13 4
## 6 g8 36 10
Somewhat expected, we have more 800 scores in higher grades, but interestingly we also have more lower scores in the higher grades. This is likely a result of more variability at higher grades, and perhaps the way we simulated response data, rather than the vertical scaling itself.
Let’s take a look at our scale range by quintile for the piecewise linear approach:
## # A tibble: 6 × 4
## group scale_min scale_max scale_range
## <chr> <dbl> <dbl> <dbl>
## 1 Total 200 800 600
## 2 0%-20% 200 519 319
## 3 20%-40% 448 575 127
## 4 40%-60% 484 616 132
## 5 60%-80% 518 672 154
## 6 80%-100% 555 800 245
Since we have (simulated) longitudinal data, we may also want to perform growth modeling of the data. Here’s an example of what that could look like.
First, let’s visualize student growth over time using the scaled scores.
tk07.scaled.scores %>%
mutate(student = rep(1:1000,6)) %>%
ggplot(., aes(x = timepoint, y = scale_linear_piece, group = student, color = as.factor(student))) +
geom_line(alpha = 0.2) +
labs(
title = "Scaled Scores Over Time",
x = "Timepoint",
y = "Scaled Score"
) +
theme_clean() +
theme(legend.position = "none")It seems like we have a trend upward, which is good; we would expect/hope that students improve in ability over time.
Let’s create a linear mixed model of student growth over time. If you’re not familiar with mixed modeling, it basically means that we have both fixed and random effects.
Fixed effects are effects that apply to EVERYONE
in the model. If you’ve performed simple linear regression before, all
parameters for the predictors in the model are fixed
effects. In our case, the fixed effects would be an intercept (probably
around 500 or so), and a slightly positive fixed effect for
timepoint (scores increase over time).
Random effects accounts for individual variability within for a certain level within the model. In other words, certain predictors are allowed to vary randomly between cases. In our example, our predictor is Scaled Score, and random effects would apply to students, and we could have:
Let’s estimate (1) a simple linear regression, then (2) a model with random intercepts, and (3) random slopes and intercepts, and compare the two.
Specifically, our model will predict Scaled Score by Timepoint (grade),
Note, we’ll change our timepoints so g3 = 0, g4 = 1, and so on.
Let’s use the nlme::gls() function to estimate a simple
linear regression.
tk07.scaled.scores.modeling <- tk07.scaled.scores %>%
mutate(timepoint = as.numeric(gsub("g","",timepoint)) - 3) # Set timepoint g3 = 0
tk07.growth.model.1 <- gls(data = tk07.scaled.scores.modeling,
scale_linear_piece ~ timepoint,
method = "ML")
summary(tk07.growth.model.1)## Generalized least squares fit by maximum likelihood
## Model: scale_linear_piece ~ timepoint
## Data: tk07.scaled.scores.modeling
## AIC BIC logLik
## 71148 71168 -35571
##
## Coefficients:
## Value Std.Error t-value p-value
## (Intercept) 511 2.080 245.4 0
## timepoint 17 0.687 24.4 0
##
## Correlation:
## (Intr)
## timepoint -0.826
##
## Standardized residuals:
## Min Q1 Med Q3 Max
## -4.3415 -0.4974 0.0511 0.5837 2.7272
##
## Residual standard error: 90.9
## Degrees of freedom: 6000 total; 5998 residual
Here’s how to interpret some of the output:
(Intercept) is at timepoint 0 (g3), where the
average Scaled Score is 511. This parameter is significantly different
from 0 (p-value = 0).timepoint is the slope of our predictor
“timepoint”. It’s positive (Value = 17) and significantly
differnet from 0, so there is growth, and it represents that with each
increase (or decrease) in 1 grade from the intercept, our Scaled Score
is estimated to increase (or decrease) by 17 points.Next, let’s see if allowing intercepts to vary across student our model improves.
tk07.growth.model.2 <- lme(fixed = scale_linear_piece ~ timepoint,
random = ~ 1 | student,
data = tk07.scaled.scores.modeling,
method = "ML")
summary(tk07.growth.model.2)## Linear mixed-effects model fit by maximum likelihood
## Data: tk07.scaled.scores.modeling
## AIC BIC logLik
## 65702 65729 -32847
##
## Random effects:
## Formula: ~1 | student
## (Intercept) Residual
## StdDev: 78.9 45.1
##
## Fixed effects: scale_linear_piece ~ timepoint
## Value Std.Error DF t-value p-value
## (Intercept) 511 2.701 4999 189.0 0
## timepoint 17 0.341 4999 49.3 0
## Correlation:
## (Intr)
## timepoint -0.316
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -6.7203 -0.3515 0.0606 0.4561 3.7962
##
## Number of Observations: 6000
## Number of Groups: 1000
## Model df AIC BIC logLik Test L.Ratio p-value
## tk07.growth.model.1 1 3 71148 71168 -35571
## tk07.growth.model.2 2 4 65702 65729 -32847 1 vs 2 5448 <.0001
This output indicates that our Random Intercept (under
Random effects:) accounts for 78.9 units of variance, and
the remaining unexplained (residual) variance is 45.1. This means that
of all the variance in the model, the introduction of the random
intercept accounts for a little less than 2/3 of the variance.
Furthermore, the last table is an ANOVA comparing our Fixed and Random Intercept models, indicating that the random intercept model is significantly better than our Fixed model.
Now let’s see if allowing intercepts and slopes to vary across student our model improves.
tk07.growth.model.3 <- lme(fixed = scale_linear_piece ~ timepoint,
random = ~ timepoint | student,
data = tk07.scaled.scores.modeling,
method = "ML")
summary(tk07.growth.model.3)## Linear mixed-effects model fit by maximum likelihood
## Data: tk07.scaled.scores.modeling
## AIC BIC logLik
## 65086 65127 -32537
##
## Random effects:
## Formula: ~timepoint | student
## Structure: General positive-definite, Log-Cholesky parametrization
## StdDev Corr
## (Intercept) 58.76 (Intr)
## timepoint 8.92 0.886
## Residual 41.88
##
## Fixed effects: scale_linear_piece ~ timepoint
## Value Std.Error DF t-value p-value
## (Intercept) 511 2.091 4999 244.2 0
## timepoint 17 0.424 4999 39.6 0
## Correlation:
## (Intr)
## timepoint 0.241
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -6.1165 -0.3756 0.0175 0.4280 4.3806
##
## Number of Observations: 6000
## Number of Groups: 1000
## Model df AIC BIC logLik Test L.Ratio p-value
## tk07.growth.model.1 1 3 71148 71168 -35571
## tk07.growth.model.2 2 4 65702 65729 -32847 1 vs 2 5448 <.0001
## tk07.growth.model.3 3 6 65086 65127 -32537 2 vs 3 619 <.0001
The ANOVA comparing all of our models indicates that our last model, with Random Slopes and Intercepts, is significantly better than our Random Intercepts model, which is significantly better than our Fixed model.
Finally, let’s visualize the growth model. Note the random effects
from the lme() models are effects beyond the fixed effects
(so an individual’s actual intercept is fixed_intercept +
random_intercept; same w/ the slope.)
fixed_intercept <- tk07.growth.model.3$coefficients$fixed[1]
fixed_slope <- tk07.growth.model.3$coefficients$fixed[2
]
tk07.growth.predictions <- tk07.growth.model.3$coefficients$random %>%
data.frame() %>%
rename_with( ~ c("random_intercept","random_slope")) %>%
mutate(g3 = fixed_intercept + random_intercept + 0 * (fixed_slope + random_slope),
g4 = fixed_intercept + random_intercept + 1 * (fixed_slope + random_slope),
g5 = fixed_intercept + random_intercept + 2 * (fixed_slope + random_slope),
g6 = fixed_intercept + random_intercept + 3 * (fixed_slope + random_slope),
g7 = fixed_intercept + random_intercept + 4 * (fixed_slope + random_slope),
g8 = fixed_intercept + random_intercept + 5 * (fixed_slope + random_slope)) %>%
pivot_longer(cols = c(g3, g4, g5, g6, g7, g8),
names_to = "timepoint",
values_to = "predicted_score") %>%
select(timepoint, predicted_score) %>%
mutate(student = sort(rep(1:1000,6)))
tk07.growth.predictions %>%
ggplot(., aes(x = timepoint, y = predicted_score, group = student, color = as.factor(student))) +
geom_line(alpha = 0.2) +
labs(
title = "Model-Estimated Growth Trajectories",
x = "Timepoint",
y = "Predicted Scaled Score"
) +
theme_clean() +
theme(legend.position = "none")This visualization presents the ability trajectories we would expect from our simulated data, and are similar to the trajectories shown in the True (Latent) Ability section: a general increase in scores for most students, but with increased variability over time.
This vertical scaling project demonstrates how to implement and evaluate vertical scaling techniques for educational assessments. The document thoroughly covers the application of the plink package to conduct IRT-based common-item vertical scaling across multiple groups and timepoints, illustrating the process with two main examples: a simpler two-group scenario and a more complex six-grade longitudinal study.
Key achievements of this project include:
Overall, this document serves as a comprehensive guide to vertical scaling within the context of educational measurement, offering valuable insights and methodologies that can be applied to similar assessment scenarios. It not only enhances understanding of vertical scaling but also demonstrates the practical application of theoretical concepts using R, making it a valuable resource for both educational researchers and psychometricians.